Analyzing Data: Conditional Means

The conditional mean will be our first predictive algorithm. Conditional means answer the question: “Given what we know about a certain case, what can expect to see, on average?” The conditional mean is a powerful tool that is typically quite easy to explain to decision-makers.

We’ll go through the following steps:

  1. Computing and plotting unconditional means
  2. Computing and plotting conditional means using a single predictor.
  3. Computing and plotting conditional means using multiple predictors.

Dataset for this week

We will be working with a dataset put together by the Census Bureau that summarizes the characteristics of the 3,088 counties in the United States.

load("pd.Rdata")

The codebook for this dataset is stored as another dataset, labels_explain. The first column in this dataset is variable names, the second column is a full explanation of that variable.

## Full explanation of data in codebook
load("pd_lab_explain.Rdata")

#or use View
#View(lab_explain)

Filter and select

Examine Davidson County, TN only. Name of DataSet, “pipe to” filter county, named Davidson County. From within that, select the county name and per capita income in 2012.

Then, examine what per capita income looks like in counties with different levels of education.

Name of data set, pipe to filter college graduate percent that is greater than 30%. Within that, select county, college graduate percentage. Arrange in descending order of per capita income.

pd%>%filter(county=="Davidson County, TN")%>%
  select(county,percapinc.2012)
## # A tibble: 1 x 2
##   county              percapinc.2012
##   <chr>                        <dbl>
## 1 Davidson County, TN          51526
pd%>%filter(coll_grad_pc>30)%>%
  select(county,coll_grad_pc,percapinc.2012)%>%
  arrange(-percapinc.2012)
## # A tibble: 339 x 3
##    county                   coll_grad_pc percapinc.2012
##    <chr>                           <dbl>          <dbl>
##  1 New York County, NY              58.1         119347
##  2 Marin County, CA                 54.6          93407
##  3 Teton County, WY                 49            93194
##  4 Nantucket County, MA             43.6          83634
##  5 Arlington County, VA             71.2          83242
##  6 Haines Borough, AK               33            82323
##  7 Fairfield County, CT             44.6          81068
##  8 Alexandria city, VA              60.5          80952
##  9 Pitkin County, CO                55.3          80066
## 10 San Francisco County, CA         52            80014
## # … with 329 more rows

The county with the highest level of income given that their college graduate percent was greater than 30% is New York County, New York. That’s followed by Marin County in California.

Quick Exercise: What’s per capita income in Los Angeles County, CA in 2012?

pd%>%filter(county=="Los Angeles County, CA")%>%select(county,percapinc.2012)
## # A tibble: 1 x 2
##   county                 percapinc.2012
##   <chr>                           <dbl>
## 1 Los Angeles County, CA          44474

Answer: 44474

Dependent Variable

Our working example will be based on predicting income in a given county. Suppose we want to know what income level we can expect for a geographic area based on observed characteristics, such as the proportion of the population with a bachelor’s degree. How would we predict the income based on what we know about the geographic area?

Let’s begin by plotting the data to see what it looks like. To do this I need to first rank the counties by income. To create a rank variable that will be stored in the pd dataset, I use the mutate command. This creates a variable based on some calculation then stores it in the same dataset. I’m then going to plot incomes for each county in descending rank order. Using the plotly library I can make this interactive so we know which counties we’re talking about.

This code creates a new variable called percapinc_rank, which ranked all counties on the basis of their income.

PD dataset, defined as, PD dataset piped to, new command, mutate.The mutate command, it’s argument, state name of new variable “percapinc_rank.” This is equal to the rank of percapita income in 2010.

Question: Does it rank in ascending order? descending order?

## Create a rank variable for income 
pd<-pd%>%mutate(percapinc_rank=rank(percapinc.2010))

The next code will create a graphic for us. We will be using `ggplot to make all of our graphics in this class. The first step in ggplot is to create a graphics object. We could name it anything, I’m just going to name it gg. Within this object, I declare the data (pd) the x variable (percapinc_rank) and the y variable (percapinc.2010). I also note that I want to use the county name (county) as text.

ggplot requires two arguments. In the first, the data argument, we tell it to use the pd data set. In the second, aesthetic argument, we tell what the x and y variables are.The text is the county.

## Plot by rank
gg<-ggplot(data=pd, aes(x=percapinc_rank,
                         y=percapinc.2010,
                         text=county))

Now I need to declare the type of graphic, or geometry. By specifiying geom_point I’m saying I want a scatterplot.

## Add Points
gg<-gg+geom_point(alpha=.5,size=.5)

Now I’m going to add labels for the x and y axis.

## Add labels
gg<-gg+xlab("Rank")+ylab("Per Capita Income, 2010")

And now we’re ready to call the graphics object, gg

gg

I’m going to store this gg object in another object called gg1. This is so I can come back to it later.

gg1<-gg

Making an interactive plot makes it possible to put our mouse cursor over a particular point and see what county it corresponds to.

# Make Interactive plot
gg_p<-ggplotly(gg)

gg_p

Unconditional Means

If you were asked to predict the income for a given area without any additional information, the likely best guess is the overall average. We’re going to begin with the unconditional mean, or simple average, as our first prediction. We’ll again use the mutate command to plug in a variable that will be the average for every county, and we’ll plot this as a predictor.

Our notation for the unconditional mean as a predictor is:

\[\hat{Y}=\bar{Y} \]

What this says is our predicted income \(\hat{Y}\) is equal to average income \(\bar{Y}\) We’ll always use \(Y\) as the notation for a dependent variable and \(X\) as the notation for an independent variable.

Summarize command, makes new variable mean_percapinc.2010 = mean of percapita income in 2010. The ‘na.rm’ tells Rstudio that if there’s any missing data, simply remove it from the calculation. So, the mean per capita income in 2010 is 34,000. This is a prediction, without knowing anything else.

Mutate adds a new variable to the dataset that’s equal to the mean of a particular variable. The arrow changes the dataset.

##Unconditional Average
pd%>%summarize(mean_percapinc.2010=mean(percapinc.2010,na.rm=TRUE))
## # A tibble: 1 x 1
##   mean_percapinc.2010
##                 <dbl>
## 1              34003.
##Unconditional Average as a Predictor
pd<-pd%>%mutate(mean_percapinc.2010=mean(percapinc.2010,na.rm=TRUE))

##Plotting
gg<-ggplot(data=pd,aes(y=percapinc.2010,x=percapinc_rank,color="Actual"))
gg<-gg+geom_point(alpha=.5,size=.5)
gg<-gg+geom_point(aes(y=mean_percapinc.2010,x=percapinc_rank,
                  color="Predicted: Unconditional Mean"),
                  size=.5)
gg<-gg+xlab("Rank of Per Capita Income")+ylab("Per Capita Income")
gg<-gg+scale_color_manual(name="Type",
                          values=c("Actual"="black",
                          "Predicted: Unconditional Mean"="blue")
                          )
gg<-gg+theme(legend.position="bottom")

gg

##Save for later

gg2<-gg

This is of course a terrible prediction. In the absence of any other information, it’s many times the best we can do, but we really ought to be able to do better.

To understand how far off we are, we need to summarize our errors. We will use different ways of doing this this semester, but let’s start with a very standard one, Root Mean Squared Error, or RMSE. An error term is the vertical distance between each point and its prediction. The RMSE is the square root of the sum of squared errors. (Q:why do we square them?).

\[RMSE(\hat{Y})=\sqrt{ 1/n \sum_{i=1}^n(Y_i-\hat{Y_i})^2} \]

The error term for our prediction using unconditional means will be stored in the variable \(e1\). This variable will be equal to the actual value of per capita income percapinc.2010 minues the mean value of per capita income mean_percapinc.2010.

pd<-pd%>%mutate(e1=percapinc.2010-mean_percapinc.2010)

To calculate the root mean squared error, we use the rmse function from the Metrics library. The code below calculates and displays the rmse

rmse(dataset-takethat-variable,dataset-takethat-variable)

## RMSE

rmse_uncond_mean<-rmse(pd$percapinc.2010,pd$mean_percapinc.2010)

rmse_uncond_mean
## [1] 7817.037

Output: 7817.037 On average, we would be off by $7,817 across the whole dataset. What this means is, on average, we are off by 7817.04, which is a lot!

##Conditional Means With One Predictor Variable

To incorporate additional information into the mean, we need to calculate averages at levels of other predictors. Let’s calculate the average per capita income at different levels of college education. The code below will calculate average income across counties at four different levels of college education– the four quantiles of college education in the dataset.

We use the mutate command to create a new variable, coll_grad_level, that splits it into four quartiles. “What quartile are you in terms of the number of college graduates?” The ntile command will divide the data into whatever number of groups we ask it to divide it up into. We command RStudio to divide this variable, coll_grad_pc, into four groups.

##Condtional Average across a single variable

## Create a variable for quartiles of college education
pd<-pd%>%mutate(coll_grad_level=ntile(coll_grad_pc,4))

table(pd$coll_grad_level)
## 
##   1   2   3   4 
## 772 772 772 772
summary(pd$coll_grad_pc)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    3.70   13.50   17.20   19.35   22.90   71.20
##The commands will be performed on each level of college grad education
pd<-pd%>%group_by(coll_grad_level)%>% ## Group by predictor
  ##Calculate mean at each level of predictor. Calculate the average income at each level of college education. The mutate command creates a new variable, pred_income_college that is the average income, grouped by college level.
  mutate(pred_income_college=mean(percapinc.2010))%>% 
  ## Ungroup
  ungroup()%>% 
  #Rank by prediction high to low, with ties sorted randomly
  mutate(pred_income_college_rank=rank(pred_income_college,ties.method="random"))

To visualize this we can use a similar graphic as before:

pd%>%group_by(coll_grad_level)%>% ## Group by predictor
  ##Calculate mean at each level of predictor
  summarise(pred_income_college=mean(percapinc.2010))
## # A tibble: 4 x 2
##   coll_grad_level pred_income_college
##             <int>               <dbl>
## 1               1              28056.
## 2               2              32371.
## 3               3              35292.
## 4               4              40293.
gg<-ggplot(data=pd,aes(x=pred_income_college_rank,y=percapinc.2010,color="Actual"))
gg<-gg+geom_point(alpha=.5,size=.5)
gg<-gg+geom_point(aes(x=pred_income_college_rank,y=pred_income_college,color="Predicted:Conditional Mean, 1 var"))
gg<-gg+ scale_color_manual("Type",values=c("Predicted:Conditional Mean, 1 var"="red","Actual"="black"))
gg<-gg+theme(legend.position="bottom")
gg<-gg+xlab("Rank")+ylab("Per Capita Income, 2010")
gg

##Save for later
gg3<-gg

Our notation for this predictor will be: y-hat is equal to the average of y, “given” our grouping variable to x - so, given that our college education is in the first quartile. \[\hat{Y}=(\bar{Y}|X=x) \]

That is, the predicted value of y, \(\bar{Y}\) is equal to the mean value of \(Y\) given our predictor \(X\) (college graduate levels in this case) is equal to a given value of \(X\), denoted by \(x\).

Let’s see what happened to our RMSE when we did a conditional as opposed to an unconditional mean.

rmse_cond_mean_one<-rmse(pd$percapinc.2010,pd$pred_income_college)
rmse_cond_mean_one
## [1] 6425.743

The new rmse has improved, from 7800 to 6425. We’ve made progress towards getting a better prediction. This is a good. A lower RMSE means that our estimation is more accurate. If we wanted to predict what the level of income is for a given county, the educational level is a good indicator.

Quick Exercise: Calculate per capita income as a function of the proportion of the county with a high school education

We can continue “binning” the data to define averages by more and more subgroups. For instance, we might want to calculate the average income by education AND home ownership rate.

New Variable Home Ownership Rate

## Create a variable for quartiles of college education
pd<-pd%>%mutate(homeown_rate_level=ntile(homeown_rate,4))
pd%>%group_by(homeown_rate_level)%>% ## Group by predictor
  ##Calculate mean at each level of predictor
  summarise(pred_income_homeown_rate=mean(percapinc.2010))
## # A tibble: 4 x 2
##   homeown_rate_level pred_income_homeown_rate
##                <int>                    <dbl>
## 1                  1                   34925.
## 2                  2                   33245.
## 3                  3                   33528.
## 4                  4                   34314.
pd<-pd%>%group_by(coll_grad_level,homeown_rate_level)%>% ## Group by predictor
  ##Calculate mean at each level of predictor
  mutate(pred_income_coll_and_homeown=mean(percapinc.2010))%>% 
  ## Ungroup
  ungroup()%>% 
  #Rank by prediction, with ties sorted randomly
  mutate(pred_income_coll_and_homeown_rank=rank(pred_income_coll_and_homeown,
                                                ties.method="random"))
rmse_cond_mean_two<-rmse(pd$percapinc.2010,pd$pred_income_coll_and_homeown)
rmse_cond_mean_two
## [1] 6371.481

Root mean square error Error = actual data - prediction Larger errors = less predictive accuracy We can quantify this using MSE